network size
LimitstoDepth-EfficienciesofSelf-Attention
Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- Africa > Ethiopia (0.04)
On the Power and Limitations of Random Features for Understanding Neural Networks
Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these \emph{explicitly} leads to the well-known approach of learning with random features (e.g.
Symplectic Adjoint Method for Exact Gradient of Neural ODE with Minimal Memory
A neural network model of a differential equation, namely neural ODE, has enabled the learning of continuous-time dynamical systems and probabilistic distributions with high accuracy. The neural ODE uses the same network repeatedly during a numerical integration. The memory consumption of the backpropagation algorithm is proportional to the number of uses times the network size. This is true even if a checkpointing scheme divides the computation graph into sub-graphs.